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1. INTRODUCTION 

Malay language is part of the Nusantara in Austronesia language family [1]. This language is spoken 
by 290 million people across the world. This language is the national language of Malaysia and it is widely 
used in both public and private sectors in the country. In Malaysia, this language adopted Roman alphabet 
during the British administration period [2]. The Malaysian government is actively promoting the country to 
be a hub for education and medical tourism. In 2018, there were over 127,000 foreign students in Malaysia. 
The number reached 130,000 in 2019. On the other hand, over one million medical tourists arrived at 
Malaysia in year 2017. The number reached 1.3 million in year 2020. The need of suitable machine 
translation (MT) is essential to help international students and tourists to understand conversation and content 
when dealing with the locals [3]. 

As a type of natural language process (NLP) application [4], MT involves the process of using 
computer software to translate messages from a specific language into another language [5], [6], [7]. This 
process involves a source natural language (e.g., English) and a target natural language (e.g.: Malay) [6]. This 
is the essential process [8] in news translation, movie subtitling, question/answer systems and chatbots with 
understanding of different languages [9]. Two common state-of-the art approaches [5] are statistical machine 
translation (SMT) and neural machine translation (NMT) [10], [11]. 

The statistical machine translation takes the word-to-word approach between the source and target 
words. The process involves statistical analysis using the text corpora [12]. Further enhancement of the 
approach will restrict the alignment of each source word with exactly one target word [13], [14]. The similar 
approach is used in speech recognition by applying hidden markov model (HMM). 

On the other hand, NMT adopted deep neural network [15] using recurrent neural network (RNN), 
long short-term memory (LSTM) and gated recurrent unit (GRU). The fundamental unit in NMT is a vector 
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[16]. NMT depends on a word embedding to transform the word sequence into a vector before the model 
training can take place [17]. Besides that, there are some work done by combining both SMT and NMT to 
take advantage of the strength both models [5]. Some of these works include Stahlberg et al. that used risk 
estimation in NMT [18] and Du and Way’s cascade framework in the hybrid MT [19]. 

All the MTs require parallel text corpus to train the models. The preparation of a parallel text corpus 
is an intensive data-driven process. The SMT will require additional corpus of the target language to 
formulate the language model. Traditionally, SMT will perform well in small datasets with long sentences 
[20]. This approach demonstrated better performance compared to NMT with a domain mismatch between 
training and testing datasets. 

MT from English to other languages had been introduced more than 35 years [21]. But, the study of 
Malay language in MT begun around 1984 by Unit Terjemahan Melalui Komputer (UTMK) at Universiti 
Sains Malaysia [22]. The first online English-Malay MT system was introduced in 2002 through the 
collaboration between MIMOS and USM which was aimed at the translation gist [23]. Later in 2006, 
example-based machine translation (EBMT) uses bilingual corpus examples to form proper representation for 
the translation [23]. Google Translate is another popular platform for MT [21], [24]. In terms of NMT for 
English-Malay MT, there is very little research was carried out. 

In this manuscript, a rectified linear unit (ReLU) based attention score has been proposed to improve 
the performance of RNN-based NMT on conversational dialogue in English-Malay translation. Intuitively, 
this enhanced attention-based sequence-to-sequence NMT will be able to preserve the long sequence context 
vector and prevent common vanishing gradient problem in the deep networks. In this paper, section 2 will 
consist of a brief overview of sequence-to-sequence model. Section 3 will discuss the experiment setup. 
Section 4 will discuss the result and performance of various models used in the experiment. 


2. RELATED WORKS 

The recurrent neural network (RNN) consists of recurrent cells which the current state of the cell 
depends on both past cell states and existing input in feedback connection. The RNN unit suffers two major 
problems, the exploding gradients, and vanishing gradients [25]. This is due to the weakness of RNN unit 
that cannot handle long-term dependencies. In this experiment, the RNN-based sequence-to-sequence 
(Seq2seq) NMT models were used to compare their performance. These RNN models are: i) long short-term 
memory (LSTM), ii) bidirectional LSTM (Bi-LSTM), and iii) gated recurrent unit (GRU). 


2.1. Long short-term memory (LSTM) 

The long short-term memory (LSTM) was proposed by Hochreiter and Schmidhuber [26]. This 
RNN based neural netwell uses gates to retain information in the cell. This architecture is capable to deal 
with the long-term dependencies issue suffers in RNN. There are three gates in LSTM, the input gate, forget 
gate and output gate. The input gate takes in previous hidden state and current input. It decides which values 
will be updated with a sigmoid function. The forget gate decides which information from previous hidden 
state and current input to retain or discard. Lastly, the output gate decides what the next hidden state should 
be. 


2.2. Bidirectional LSTM (Bi-LSTM) 

The main idea behind Bi-LSTM is to combine input information in the past and future of a specific 
time step in LSTM model [27]. This architecture facilitates more input information in the network by 
allowing the network to preserve past future information. The implementation consists of a regular RNN unit 
that has two directions or states, one for positive time direction or called forward states and another direction 
in negative time called backward states. 


2.3. Gated recurrent unit (GRU) 

Gated recurrent unit simplifies the LSTM network by removing the cell state in the network. It uses 
a hidden state to transfer information. There are only two gates, the reset and update gates in GRU [28], 
which have the advantage of retaining information from long ago. The update gate will determine the amount 
of information from the past time step to pass along to the future. Meanwhile, the reset gate will decide the 
amount of past information to retain. 


2.4. Sequence-to-sequence (seq2seq) 

In the original sequence-to-sequence (seq2seq) model introduced by Sutskever et al. [29], it has two 
major components, an encoder, and a decoder [29]. The encoder consists of a stack of recurrent units where it 
will take in each element in the input sequence. It will collect information about its internal state to form 
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internal state vector or called content vector. Then, it will forward it through propagation. The hidden state h; 
is computed by (1) using the existing input x;, previous state h,_, and the network weight, W. 


h, = fw" h,_4 + w'*x,) (1) 


At the other end, the decoder also consists of a stack of recurrent units where it will predict an 
output at each time step t. The initial state of the decoder is initialized from the final states of the encoder. 
Each of the recurrent unit will accept a hidden state from the previous unit and compute its own hidden state. 
The hidden state h; of the decoder is computed using (2). 


h, = f(W"" hy-1) (2) 


Then, the output y; at time step t is computed using (3). This requires the combination of both 
hidden state of the existing time step and respective weight W*. The Softmax function is applied to generate 
the probability vector of output. 


y, = softmax(W*h,) (3) 


The result achieved in Sutskever et al. [29] model is 34.81 in BLEU score which is above the SMT baseline 
which is 33.30. 


2.5. Attention mechanism 

The attention mechanism was first introduced in Bahdanau et al. [10]. It aims to solve representation 
issue in seq2seq model. In seq2seq, the decoder only received the last encoder’s hidden state. The attention 
mechanism works as part of the network to capture the important parts of the source [30]. This mechanism 
works an interface between the encoder and decoder. Hence, the decoder is provided with all the encoder’s 
hidden states [31]. 

The seq2seq model with attention implementation consists of the encoder, decoder, and attention 
layers. Within the attention layer, there are three components which include alignment layer, attention 
weights and context vector. The alignment layer maps the input at time step t and the output from previous 
time step t — 1. This is based on the previous state h,_, and previous state s,_,. The alignment score is 


Trp = vatanh(Ws,_4 +W""h;_1) (4) 


In this experiment, the hyperbolic tangent, tanh function is replaced with ReLU function. Hence, 
equation (4) will become, 


Trp = vg RELU(W*Ssy_, + W""hy_1) (5) 
This adjustment aims to enhance the alignment score to overcome the common vanishing gradient issue 
which commonly occurs in tanh alignment score [32], [33]. 


The alignment score is computed using (6). 


exp (Tp) ( 6) 


a i ov 
yl exp (rrp) 


The context vector c, requires the previous state hy_,, previous state s,_, and alignment score as shown in 


(7). 
Cy = pH Arphy (7) 


Hence, the decoder will generate output with next target hidden state by accepting input from previous state 
Yp-1 and source context vector c, as shown in (8). 


Sp = f(W*sp_1 + Wyp_1 + WSC) (8) 


The j*" decoder’s target hidden state requires the previous hidden state as in (9). 
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tj = f(W*s; + W**yj-1 + W*c) (9) 
Finally, output word is produced using the probability distribution P; using the Softmax function using (10). 


P; = softmax(W*t;) (10) 


3. METHODOLOGY 
3.1. The English-Malay parallel text corpus 

In this experiment, the English-Malay parallel corpus were collected. The compiled corpus consists 
of parallel text for models training and test purpose. These parallel texts were extracted from the following 
sources: i) bilingual sentence pairs from ManyThings.org.; ii) local Malay movie bilingual subtitles; and iii) 
translated English-Malay bilingual translation corpus [34]. 

All these corpuses are not in ready form. Hence, some pre-processing was required to compile it into 
single bilingual sentence pairs corpus [6]. In this study, the pre-processing is required allow better processing 
for the algoritm [35]. These processing involves: 

— Data loading: This step involves loading all the data from different sources, comma delimited format 
(csv) text files and JSON format into single csv file. 

—  Lowercasing: This step converts the text to lowercase form to prevent variation in mixed case typing in 
text and sparsity issue. 

— Punctuation, symbols removal and non-text character removal: All the non-text characters in the data 
are removed to allow the language model fully trained on text-based tokens. 

— Word tokenization: This step involves splitting the text into word token before feeding into the model 
for training. 


3.2. Evaluation 

The bilingual evaluation understudy (BLEU) score was used in this experiment to evaluate the 
quality of the translation. This score compared the translated text with the original reference translation text 
[36]. The evaluation involves matching n-grams in the target translation with the n-grams reference text. This 
evaluation matrix has these advantages: i) it is quick and simple to calculate, ii) it is language independent, 
iii) it has high correlation with human evaluation, and iv) it is widely adopted the NMT for evaluation. 

In this experiment, four models were trained, and the models’ BLEU scores were computed. These 
models are: i) vanilla LSTM seq2seq, ii) LSTM seq2seq with attention mechanism using tanh alignment and 
ReLU alignment, iii) GRU seq2seq with attention mechanism using tanh alignment and ReLU alignment, iv) 
bidirectional-LSTM seq2seq with attention mechanism using tanh alignment and ReLU alignment, and v) 
bidirectional-GRU seq2seq with attention mechanism using tanh alignment and ReLU alignment 

Early stopping was introduced in the model training. This implementation was introduced to prevent 
overfitting during training. The mechanism used the training’s validation loss to determine when to stop the 
model training. 


4. RESULT AND DISCUSSION 

In this experiment, all the models were setup and configured using Google Tensorflow-GPU 2.2. 
The parallel corpus used for training consists of 189,000 pairs of bilingual English-Malay sentence pairs. The 
testing dataset consists of 199 pairs of bilingual sentence pairs. Total vocabulary from source and target were 
8183 and 6938 word respectively. The out of vocabulary (OOV) token was incorporated to substitute words 
that did not exist in the embedding. Early stopping was incorporated in the models training. All the models’ 
output was evaluated using BLEU score. Hence, the reference text in the dataset must consist of at least 4 
words. 

A vanilla LSTM seq2seq model was used as the baseline model. This vanilla LSTM seq2seq model 
consisted of both encoder and decoder that had a 300-dimension embedding and a single hidden LSTM layer 
with 512 neurons. During the training, this model stopped at epoch 44. The model achieved a BLEU score of 
80.39. The same test dataset was loaded into Lingvanex.com for translation and the score of the translation is 
62.04. 

Next, four different seq2seq models were setup and trained. These models incorporated with 
Bahdanau attention mechanism [10]. Table | shows the training epoch for all the models. Generally, all 
models converaged faster when incorporated the attention mechanism in the seq2seq models as compared to 
the vanilla model. All these models achieved validation loss that are below 0.36 and converged between 
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epoch 24 and 38. Among these models, the bidirectional models such as Bi-LSTM and bidirectional GRU 
(Bi-GRU) took 24 and 27 epochs or about 39% less epoch to converge in the training. 


Table 1. Training epoch and duration for models 


Model No of Epoch Duration for each epoch 
Vanilla LSTM model (baseline model) 44 106s 100ms/step 
LSTM Tanh alignment 31 143s 134ms/step 
LSTM ReLU alignment 37 144s 135ms/step 
GRU Tanh alignment 37 135s 126ms/step 
GRU ReLU alignment 38 133s 124ms/step 
BiLSTM Tanh alignment 24 247s 232ms/step 
BiLSTM ReLU alignment 24 250s 235ms/step 
BiGRU Tanh alignment 27 234s 219ms/step 
BiGRU ReLU alignment 25 233s 218ms/step 


Table 2 shows the samples from the various models. From the experiment, all the models achieved 
higher BLEU scores between 0.90 and 4.57 as compares to the baseline model. Among the models, Bi-LSTM 
with ReLU attention mechanism was able to achieve BLEU score of 85.14 which is about 4.75 better than the 
vanilla model. This followed by Bi-GRU model with ReLU attention mechanism at BLEU score of 83.74 and 
3.35 above the vanilla mode. Generally, the models with ReLU attention alignment were able to achieve 
higher accuracy as compared to the Tanh attention alignment from 0.26 in LSTM model to 1.12 in Bi-LSTM 
model. 

Tables 3 to 6 show the attention weights of translation samples from Bi-LSTM model with Tanh and 
ReLU attention alignment. Based on the samples, the ReLU attention alignment model generally has higher 
weights as compared to the Tanh attention alignment. Besides that, the weights are aligned closely to the 
intended output words. On top of that, the emphasis of the attention weights in ReLU attention alignment are 
relatively stronger on the input token as compared to other tokens in the sequence. 


Table 2. BLEU score for testing result of seq2seq models with attention mechanism 
Attention Aligment 


Model Tanh ReLU 
LSTM 83.38 83.65 
GRU 81.28 81.63 
BiLSTM 84.02 85.14 
BiGRU 83.45 83.74 


Table 3. Attention weights for Bi-LSTM with Tanh attention alignment for sample result 1 
perancis adalah di eropah barat 


france 9.97E-01 1.40E-03  5.79E-05  1.24E-04 7.20E-05 = 1.91E-03 
is 9.04E-04 1.27E-01 2.17E-03 1.65E-04 8.54E-04 = 1.51E-01 
in 9.02E-04 7.86E-01 8.44E-01 4.64E-03 1.48E-03  1.36E-01 
western 4.72E-04 2.15E-02 1.31E-03 1.01E-02 9.88E-01 = 1.13E-01 
europe _2.79E-04 _1.33E-02 _1.52E-01 _9.84E-01 4.15E-03 _3.61E-02 


Table 4. Attention weights for Bi-LSTM with ReLU attention alignment sample result | 


perancis adalah di eropah barat 
france 1.00E+00 3.55E-03 1.27E-06 5.01E-09  1.65E-07 = 1.74E-03 
is 7A4E-05 3.49E-01 2.48E-05 4.97E-08  3.69E-05 2.62E-02 


in 4.36E-05  6.30E-01 9.88E-01  8.70E-05  2.26E-03 = 2.25E-02 
western 9.51E-08 3.99E-03 1.20E-06 2.14E-04 9.29E-01 2.94E-03 
europe 1.91E-06 _5.88E-03__1.18E-02 _1.00E+00 3.90E-02 8.85E-03 


Table 5. Attention weights for Bi-LSTM with Tanh attention alignment sample result 2 
kami mempunyai masa yang baik 
we 8.90E-01 1.45E-03 2.43E-04 1.34E-03 1.30E-03 4.03E-02 
are 5.88E-02 9.35E-03 1.35E-04 1.32E-03 5.36E-04 2.76E-02 
having 3.33E-02 8.70E-01 1.53E-02 3.08E-02 6.19E-03 1.45E-01 
good 2.64E-03 1.53E-03 6.80E-04 9.41E-01 9.78E-01  8.51E-02 
time 5.18E-03 1.14E-01 9.72E-01 _1.96E-02 _1.36E-02 _1.49E-01 
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5. 


Table 6. Attention weights for Bi-LSTM with ReLU attention alignment sample result 2 
kami mempunyai masa yang baik 
we 8.19E-01 1.34E-04 5.19E-08 6.24E-05 3.54E-06  3.35E-04 
are 6.48E-02 1.34E-03 5.67E-06 2.67E-04 4.77E-06  3.80E-04 
having 6.59E-02 9.95E-01 8.52E-02 2.22E-01 1.20E-03 2.78E-02 
good 1.16E-03 3.40E-06 2.40E-07 7.07E-01 9.99E-01  1.38E-03 
time 5.84E-03 3.53E-03 _9.15E-01 _6.46E-02 _1.12E-04 —4.23E-02 


CONCLUSION 
In this paper, we empirically evaluated different seq2seq models based on the attention alignment 


for neural machine translation in English to Malay language. The evaluation focused on task of sequence 
modelling using English-Malay bilingual parallel text corpus. As there is very limited work done using neural 
machine translation in this area, this paper focuses on the used of ReLU attention alignment to improve the 
performance of the translation. Generally, the Bi-LSTM and Bi-GRU are able to achieve higher BLEU score 
as compared to the original Tanh alignment score which as confirmed by the results. 
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